Pivot Table on Pandas

In this notebook, we focus on the pivot_table feature of pandas.

Import Modules


In [1]:
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go

Import data


In [2]:
df = pd.read_csv('./asset/sydney_housing_market.txt', sep='\t')
df.head()


Out[2]:
suburb council type distance_to_CBD value sold
0 Turrella Rockdale house 9.9 $839,676 17
1 Sydenham Marrickville house 7.4 $845,771 21
2 St Peters Marrickville house 6.9 $920,169 50
3 Tempe Marrickville house 8.3 $921,390 59
4 Waterloo Sydney house 4.7 $937,316 11

Pivot Table

In order to build up a pivot table, we must specify an index.


In [19]:
pd.pivot_table(df, index=['type'])


Out[19]:
distance_to_CBD sold value
type
house 6.585 69.20 1024662.90
unit 6.280 108.55 585739.05

Note that the default aggregation function is np.mean. We can specify the aggregation function in the aggfunc parameter, as shown below.


In [18]:
pd.pivot_table(df, index=['type'], aggfunc={'distance_to_CBD':np.mean, 'sold':np.sum})


Out[18]:
distance_to_CBD sold
type
house 6.585 1384
unit 6.280 2171

For simplicity, we will stick with the default aggregation function.

We also want to see value, but we need to change it into floats first


In [4]:
df['value']=df['value'].replace('[\$,]','',regex=True).astype(float)
pd.pivot_table(df, index=['type'])


Out[4]:
distance_to_CBD sold value
type
house 6.585 69.20 1024662.90
unit 6.280 108.55 585739.05

We could also choose more than one column as index


In [5]:
pd.pivot_table(df, index=['type','council'])


Out[5]:
distance_to_CBD sold value
type council
house Ashfield 7.600000 79.500000 1.101914e+06
Botany Bay 8.200000 29.000000 1.090744e+06
Canada Bay 9.200000 15.000000 1.036064e+06
Canterbury 9.100000 36.000000 1.066214e+06
Leichhardt 5.200000 225.000000 1.076623e+06
Marrickville 6.742857 81.285714 9.933390e+05
Rockdale 9.900000 17.000000 8.396760e+05
Sydney 4.616667 55.666667 1.037789e+06
unit Ashfield 7.233333 109.666667 5.933117e+05
Botany Bay 8.200000 56.000000 4.844870e+05
Burwood 8.400000 43.000000 6.059270e+05
Canterbury 9.100000 15.000000 5.930810e+05
Lane Cove 7.000000 220.000000 6.112070e+05
Marrickville 6.550000 87.333333 5.929137e+05
Randwick 5.200000 53.000000 5.778140e+05
Ryde 7.900000 139.000000 5.729230e+05
Sydney 3.760000 158.400000 5.863850e+05

columns provide an additional way to segment the data


In [6]:
pd.pivot_table(df, index=['council'], columns=['type'])


Out[6]:
distance_to_CBD value sold
type house unit house unit house unit
council
Ashfield 7.600000 7.233333 1.101914e+06 593311.666667 79.500000 109.666667
Botany Bay 8.200000 8.200000 1.090744e+06 484487.000000 29.000000 56.000000
Burwood NaN 8.400000 NaN 605927.000000 NaN 43.000000
Canada Bay 9.200000 NaN 1.036064e+06 NaN 15.000000 NaN
Canterbury 9.100000 9.100000 1.066214e+06 593081.000000 36.000000 15.000000
Lane Cove NaN 7.000000 NaN 611207.000000 NaN 220.000000
Leichhardt 5.200000 NaN 1.076623e+06 NaN 225.000000 NaN
Marrickville 6.742857 6.550000 9.933390e+05 592913.666667 81.285714 87.333333
Randwick NaN 5.200000 NaN 577814.000000 NaN 53.000000
Rockdale 9.900000 NaN 8.396760e+05 NaN 17.000000 NaN
Ryde NaN 7.900000 NaN 572923.000000 NaN 139.000000
Sydney 4.616667 3.760000 1.037789e+06 586385.000000 55.666667 158.400000

Note that NaN implies that there is no data here

The default aggfunc is avg but we could use other functions such as np.sum


In [7]:
pd.pivot_table(df, index=['type'],aggfunc=np.sum)


Out[7]:
distance_to_CBD sold value
type
house 131.7 1384 20493258.0
unit 125.6 2171 11714781.0

Use margins=True to show the total numbers


In [8]:
pd.pivot_table(df, index=['type','council'], aggfunc=np.sum, margins=True)


Out[8]:
distance_to_CBD sold value
type council
house Ashfield 15.2 159.0 2203829.0
Botany Bay 8.2 29.0 1090744.0
Canada Bay 9.2 15.0 1036064.0
Canterbury 9.1 36.0 1066214.0
Leichhardt 5.2 225.0 1076623.0
Marrickville 47.2 569.0 6953373.0
Rockdale 9.9 17.0 839676.0
Sydney 27.7 334.0 6226735.0
unit Ashfield 21.7 329.0 1779935.0
Botany Bay 8.2 56.0 484487.0
Burwood 8.4 43.0 605927.0
Canterbury 9.1 15.0 593081.0
Lane Cove 7.0 220.0 611207.0
Marrickville 39.3 524.0 3557482.0
Randwick 5.2 53.0 577814.0
Ryde 7.9 139.0 572923.0
Sydney 18.8 792.0 2931925.0
All 257.3 3555.0 32208039.0

We should use avg for value but sum for sold, and we do not want to see distance_to_CBD for now


In [9]:
pd.pivot_table(df, index=['council','suburb'], 
               columns=['type'], 
               values=['sold', 'value'], 
               aggfunc={'sold':np.sum, 'value':np.mean}, 
               margins=True)


Out[9]:
value sold
type house unit All house unit All
council suburb
Ashfield Ashfield 1123787.0 575096.00 849441.500 110.0 228.0 338.0
Haberfield NaN 573347.00 573347.000 NaN 13.0 13.0
Summer Hill 1080042.0 631492.00 855767.000 49.0 88.0 137.0
Botany Bay Eastlakes 1090744.0 484487.00 787615.500 29.0 56.0 85.0
Burwood Croydon NaN 605927.00 605927.000 NaN 43.0 43.0
Canada Bay Mortlake 1036064.0 NaN 1036064.000 15.0 NaN 15.0
Canterbury Hurlstone Park 1066214.0 593081.00 829647.500 36.0 15.0 51.0
Lane Cove Lane Cove North NaN 611207.00 611207.000 NaN 220.0 220.0
Leichhardt Leichhardt 1076623.0 NaN 1076623.000 225.0 NaN 225.0
Marrickville Dulwich Hill NaN 622141.00 622141.000 NaN 143.0 143.0
Enmore 1079403.0 607195.00 843299.000 58.0 15.0 73.0
Lewisham NaN 569889.00 569889.000 NaN 20.0 20.0
Marrickville 994026.0 552713.00 773369.500 191.0 225.0 416.0
Petersham 1122372.0 610890.00 866631.000 82.0 69.0 151.0
St Peters 920169.0 NaN 920169.000 50.0 NaN 50.0
Stanmore 1070242.0 594654.00 832448.000 108.0 52.0 160.0
Sydenham 845771.0 NaN 845771.000 21.0 NaN 21.0
Tempe 921390.0 NaN 921390.000 59.0 NaN 59.0
Randwick Centennial Park NaN 577814.00 577814.000 NaN 53.0 53.0
Rockdale Turrella 839676.0 NaN 839676.000 17.0 NaN 17.0
Ryde Gladesville NaN 572923.00 572923.000 NaN 139.0 139.0
Sydney Alexandria NaN 617024.00 617024.000 NaN 211.0 211.0
Chippendale 1060415.0 NaN 1060415.000 31.0 NaN 31.0
Darlington 1062426.0 NaN 1062426.000 32.0 NaN 32.0
Elizabeth Bay NaN 615967.00 615967.000 NaN 182.0 182.0
Erskineville 1005107.0 NaN 1005107.000 84.0 NaN 84.0
Newtown 1044330.0 559175.00 801752.500 166.0 104.0 270.0
Potts Point NaN 565141.00 565141.000 NaN 217.0 217.0
Rushcutters Bay NaN 574618.00 574618.000 NaN 78.0 78.0
Waterloo 937316.0 NaN 937316.000 11.0 NaN 11.0
Zetland 1117141.0 NaN 1117141.000 10.0 NaN 10.0
All 1024662.9 585739.05 805200.975 1384.0 2171.0 3555.0

Advanced Filtering over Pivot Table

We firstly build a pivot table


In [10]:
table = pd.pivot_table(df, index=['council'], columns=['type'], values=['sold'], aggfunc=np.sum, margins=True)
table


Out[10]:
sold
type house unit All
council
Ashfield 159.0 329.0 488.0
Botany Bay 29.0 56.0 85.0
Burwood NaN 43.0 43.0
Canada Bay 15.0 NaN 15.0
Canterbury 36.0 15.0 51.0
Lane Cove NaN 220.0 220.0
Leichhardt 225.0 NaN 225.0
Marrickville 569.0 524.0 1093.0
Randwick NaN 53.0 53.0
Rockdale 17.0 NaN 17.0
Ryde NaN 139.0 139.0
Sydney 334.0 792.0 1126.0
All 1384.0 2171.0 3555.0

We can just look at data from one city


In [11]:
table.query('council==["Randwick"]')


Out[11]:
sold
type house unit All
council
Randwick NaN 53.0 53.0

We can also specify multiple values


In [12]:
table.query('council==["Rockdale","Lane Cove"]')


Out[12]:
sold
type house unit All
council
Lane Cove NaN 220.0 220.0
Rockdale 17.0 NaN 17.0

Note: we can not query a pivot table from the columns. So you should put item in index if you want to query it.

Plot the data using plot.ly

Before you can execute the code below, you need to install plotly, register a free account with them, and create a profile that contains your own API key. See https://plot.ly/python/getting-started/


In [13]:
plot_table = table[:-1]  # get rid of ALL

In [14]:
plot_table.sold.house


Out[14]:
council
Ashfield        159.0
Botany Bay       29.0
Burwood           NaN
Canada Bay       15.0
Canterbury       36.0
Lane Cove         NaN
Leichhardt      225.0
Marrickville    569.0
Randwick          NaN
Rockdale         17.0
Ryde              NaN
Sydney          334.0
Name: house, dtype: float64

In [15]:
table.sold.house


Out[15]:
council
Ashfield         159.0
Botany Bay        29.0
Burwood            NaN
Canada Bay        15.0
Canterbury        36.0
Lane Cove          NaN
Leichhardt       225.0
Marrickville     569.0
Randwick           NaN
Rockdale          17.0
Ryde               NaN
Sydney           334.0
All             1384.0
Name: house, dtype: float64

In [16]:
plot_table.index


Out[16]:
Index(['Ashfield', 'Botany Bay', 'Burwood', 'Canada Bay', 'Canterbury',
       'Lane Cove', 'Leichhardt', 'Marrickville', 'Randwick', 'Rockdale',
       'Ryde', 'Sydney'],
      dtype='object', name='council')

In [17]:
trace1 = go.Bar(
    x=plot_table.index,
    y=plot_table.sold.house,
    name='House'
)
trace2 = go.Bar(
    x=plot_table.index,
    y=plot_table.sold.unit,
    name='Unit'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='pandas-notebook-plot2')


Out[17]:

Exercise

Perform some analysis that interests you using the Sydney Acution Data at https://auction-results.domain.com.au/Proofed/PDF/Sydney_Domain.pdf


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: